Location Sales Estimate Problem¶

Background¶

  • Expansion to Egypt: Company X is expanding into Egypt.
  • Recommended Locations: 2670 potential locations identified.
    • Location A: 2536 locations, $25,000 operational costs, with sales data.
    • Location B: 134 locations, $15,000 operational costs, without sales data.

title

Challenge¶

  • Feasibility: Opening all recommended locations is not feasible.
  • Objective: Select optimal locations for best coverage and sales.

Assumption¶

  • Sales Coverage: A branch covers sales of any other branch within 5 km.
    • Example:
      • Location L1: $10,000 sales.
      • Location L2: $5,000 sales, 2 km away.
      • Selecting one results in combined sales of $15,000.
    • Type A to Type B coverage works similarly.

Goal¶

  • Select Minimum Branches: Ensure maximum coverage and sales.
  • Avoid Losses: Estimated sales must exceed operational costs.

Data Overview¶

We received data on 2670 recommended locations, categorized into two types:

Type A: 2536 locations with an operational cost of 25,000 and available estimated sales data.
Type B: 134 locations with an operational cost of 15,000, without specific sales data but can cover nearby Type A locations within a 5 km radius."
In [58]:
dataFrameTypeA.head()
Out[58]:
ID Gov City_Type Latitude Longitude Estimated_Sales
0 1 الشرقية حضر 30.727543 31.793701 72021.026667
1 2 القاهرة حضر 30.235685 31.467567 54690.637778
2 3 المنوفية حضر 30.361963 30.512031 42379.708889
3 4 أسوان حضر 24.977806 32.875934 42377.312222
4 5 مرسى مطروح حضر 31.355745 27.244068 40826.774444
In [59]:
dataFrameTypeB.head()
Out[59]:
ID Gov Latitude Longitude
0 1 الجيزة 29.948291 31.051539
1 2 الجيزة 30.048963 30.974668
2 3 الجيزة 30.061117 30.960385
3 4 الفيوم 29.361996 30.683888
4 5 الفيوم 29.413400 30.862680

Check Nulls¶

In [60]:
print(dataFrameTypeA.isna().sum())
print(dataFrameTypeB.isna().sum())
ID                 0
Gov                0
City_Type          0
Latitude           0
Longitude          0
Estimated_Sales    0
dtype: int64
ID           0
Gov          0
Latitude     0
Longitude    0
dtype: int64

Check Duplicates¶

In [61]:
print(dataFrameTypeA.duplicated().sum())
print(dataFrameTypeB.duplicated().sum())
0
0

Assing cost and type¶

In [62]:
dataFrameTypeA["Type"]='A'
dataFrameTypeB["Type"]='B'
dataFrameTypeA["operational_costs"]=25000
dataFrameTypeB["operational_costs"]=15000

Number of Governments in data¶

In [65]:
len(dataFrameTypeA["Gov"].unique())
Out[65]:
27
In [66]:
len(dataFrameTypeB["Gov"].unique())
Out[66]:
15
In [67]:
len(pd.concat([dataFrameTypeA, dataFrameTypeB], ignore_index=True)["Gov"].unique())
Out[67]:
29

Governments Location Counts¶

In [69]:
plt.figure(figsize=(8, 4))
plt.bar(reshaped_text, gov_counts, color='skyblue')
plt.title(get_display(arabic_reshaper.reshape('توزيع المواقع حسب المحافظة')))
plt.xlabel(get_display(arabic_reshaper.reshape('المحافظة')))
plt.ylabel(get_display(arabic_reshaper.reshape('عدد المواقع')))
plt.xticks(rotation=45)
plt.grid(axis='y', linestyle='--', alpha=0.7)

Display Locations¶

In [77]:
plotDataFrameLongLatWithType(pd.concat([dataFrameTypeA, dataFrameTypeB], ignore_index=True))

Haversine Distance¶

1_WnHFKvg3ix6ySk-jhJX1LQ.webp

Latitude: This measures how far north or south a location is from the Equator, which is 0 degrees latitude. Latitude values range from 0 degrees at the Equator to 90 degrees north at the North Pole and 90 degrees south at the South Pole.

Longitude: This measures how far east or west a location is from the Prime Meridian, which is 0 degrees longitude,Longitude values range from 0 degrees at the Prime Meridian to 180 degrees east and 180 degrees west.

Lat_Long.gif

why we use haversine distance¶

The Haversine formula is used to calculate the shortest distance between two points on the surface of a sphere, which in this case is the Earth. The distance calculated using the Haversine formula is called the great-circle distance.

Accuracy Over Long Distances: The Haversine formula accounts for the spherical shape of the Earth, making it accurate for long distances, unlike simple Euclidean distance which assumes a flat surface.

images.png

Methodolgy¶

Option 1: Apply Clustering After Computing Estimated Sales

1-Compute Individual Sales Estimates:
For each Type A location, sum the estimated sales of all other Type A locations within 5 km. For Type B locations, assign the highest nearby Type A sales estimate within 5 km.

2-Filter Locations Based on Sales vs. Costs:
Identify which locations (both Type A and Type B) have sales estimates that meet or exceed their respective operational costs.

3-Cluster Filtered Locations:
Apply a clustering algorithm to the filtered list of locations (those that have sales estimates covering operational costs) to group them based on their geographic coordinates with a 5 km distance parameter.

Option 2: Apply Clustering Before Computing Estimated Sales

1-Cluster Locations:
Use a clustering algorithm (DBSCAN, OPTICS, HDBSCAN) to group locations based on their geographic coordinates (latitude and longitude) with a distance parameter set to 5 km. This will help identify clusters of locations that are within 5 km of each other.

2-Compute Combined Sales for Each Cluster:
For each cluster, sum the estimated sales of all locations within that cluster. Assign the combined sales to a representative location (e.g., the centroid of the cluster or the location with the highest sales within the cluster).

3-Evaluate Clusters Against Operational Costs:
Determine if the combined sales for each cluster meet or exceed the operational costs. Select the clusters (or representative locations) that achieve the best coverage and sales.

Option 1¶

Ball Tree¶

Ball Tree Construction Steps:

  1. Select a Random Point.
  2. Find the Farthest Point (Point A).
  3. Find the Farthest Point from Point A (Point B).
  4. Draw a Line Between Point A and Point B.
  5. Calculate Distances to the Line.
  6. Find the Median Point.
  7. Split the Data into Two Balls.
  8. Repeat Recursively.

title

title

title

title

title

title

Copy Version For Ball Tree Algorthim

In [87]:
dataFrameTypeABall=dataFrameTypeA
dataFrameTypeBBall=dataFrameTypeB

Recommendation-A Calculation

In [88]:
indices,distances=ballTreeCall(dataFrameTypeABall,dataFrameTypeABall)
updatedEstimastedSales=[round(sum([dataFrameTypeABall.iloc[idx]["Estimated_Sales"] for idx in indices[i]]),3) for i in range(len(dataFrameTypeABall))]
dataFrameTypeABall["updatedEstimastedSales"]=updatedEstimastedSales

Recommendation-B Calculation

In [89]:
indices,distances=ballTreeCall(dataFrameTypeABall,dataFrameTypeBBall)
estimated_sales=[]
for i in range(len(dataFrameTypeBBall)):
    if len(indices[i])>0:
        estimated_sales.append(round(dataFrameTypeABall.iloc[indices[i][0]]["updatedEstimastedSales"],3))
    else:
        estimated_sales.append(0.0)
dataFrameTypeBBall["updatedEstimastedSales"]=estimated_sales

Combine All Location Togheter To select from them

In [90]:
combinedDfBall = pd.concat([dataFrameTypeABall, dataFrameTypeBBall], ignore_index=True)

Apply Fiter for not take Locations that EstimatedSales<Operation cost

In [91]:
displayRemovedLocations(combinedDfBall)

Number of removed locations¶

In [92]:
len(combinedDfBall[combinedDfBall["updatedEstimastedSales"]<combinedDfBall["operational_costs"]])
Out[92]:
2024

Remove Location That their costs greater than their sales

In [93]:
combinedDfBall=combinedDfBall[combinedDfBall["updatedEstimastedSales"]>combinedDfBall["operational_costs"]]

Create Column For Coverage

In [94]:
combinedDfBall["Coverage"]=combinedDfBall["updatedEstimastedSales"]-combinedDfBall["operational_costs"]
In [95]:
combinedDfBall
Out[95]:
ID Gov City_Type Latitude Longitude Estimated_Sales Type operational_costs updatedEstimastedSales Coverage
0 1 الشرقية حضر 30.727543 31.793701 72021.026667 A 25000 96084.883 71084.883
1 2 القاهرة حضر 30.235685 31.467567 54690.637778 A 25000 57434.150 32434.150
2 3 المنوفية حضر 30.361963 30.512031 42379.708889 A 25000 51029.819 26029.819
3 4 أسوان حضر 24.977806 32.875934 42377.312222 A 25000 63934.114 38934.114
4 5 مرسى مطروح حضر 31.355745 27.244068 40826.774444 A 25000 68756.226 43756.226
... ... ... ... ... ... ... ... ... ... ...
2663 128 القليوبية NaN 30.145500 31.215355 NaN B 15000 47367.362 32367.362
2664 129 القليوبية NaN 30.318393 31.310492 NaN B 15000 109193.996 94193.996
2665 130 القليوبية NaN 30.186000 31.353272 NaN B 15000 30255.746 15255.746
2666 131 القليوبية NaN 30.138000 31.220216 NaN B 15000 47367.362 32367.362
2667 132 الاسماعيلية NaN 30.852917 32.306078 NaN B 15000 28021.987 13021.987

646 rows × 10 columns

Display Plots with the points contains distance 5km between each other

In [97]:
plotDataFrameLongLatWithType(combinedDfBall,showdistance=True)

Conclusion¶

We need to apply clustring algorthim or apply anyy algorthim that takes the hieghst coverage withing 5km range

Filter Algorthim¶

Select the points that have heighst coverage and be sure that no point within it in 5km range so we can filter based on coverage and range

title

In [99]:
selectedPoints = combinedDfBrute.sort_values('Coverage', ascending=False).reset_index(drop=True)
indicesWithinDistance=filterRowsWithinDistance(selectedPoints)
resultDfBrute = selectedPoints.loc[indicesWithinDistance].reset_index(drop=True)
resultDfBrute
Out[99]:
ID Gov City_Type Latitude Longitude Estimated_Sales Type operational_costs updatedEstimastedSales Coverage
0 129 القليوبية NaN 30.318393 31.310492 NaN B 15000 109193.996167 94193.996167
1 59 الشرقية NaN 30.723517 31.793006 NaN B 15000 96210.856667 81210.856667
2 116 القليوبية ريف 30.155107 31.192262 12431.234444 A 25000 105264.344444 80264.344444
3 1946 القليوبية ريف 30.284461 31.347094 168.683333 A 25000 94386.237344 69386.237344
4 343 القليوبية ريف 30.217819 31.209151 4702.623333 A 25000 85800.951111 60800.951111
... ... ... ... ... ... ... ... ... ... ...
132 643 الشرقية ريف 30.534998 31.717706 1864.231111 A 25000 25760.224444 760.224444
133 695 قنا ريف 26.103429 32.439087 1614.741111 A 25000 25749.605556 749.605556
134 1593 كفر الشيخ ريف 31.280505 31.165862 305.967778 A 25000 25678.984444 678.984444
135 452 المنوفية ريف 30.516160 30.883343 3224.993333 A 25000 25649.080000 649.080000
136 1982 اسيوط ريف 27.535805 30.793993 158.595556 A 25000 25604.987778 604.987778

137 rows × 10 columns

BriefSummary

In [100]:
print(f"The Total Number of Location Selected:{len(resultDfBrute)}")
print(f"The Total Number of A-Location Selected:{len(resultDfBrute[resultDfBrute['Type']=='A'])}")
print(f"The Total Number of B-Location Selected:{len(resultDfBrute[resultDfBrute['Type']=='B'])}")
print(f"The Total Number of Locations Covered by selected:{len(combinedDfBrute)}")
print(f"The Total Sales Equals:{resultDfBrute['updatedEstimastedSales'].sum()}")
print(f"The Total Coverage Equals:{resultDfBrute['Coverage'].sum()}")
print(f"The Total Costs Equals:{resultDfBrute['operational_costs'].sum()}")
The Total Number of Location Selected:137
The Total Number of A-Location Selected:94
The Total Number of B-Location Selected:43
The Total Number of Locations Covered by selected:647
The Total Sales Equals:5601378.195133333
The Total Coverage Equals:2606378.1951333336
The Total Costs Equals:2995000

DisplayResults

In [101]:
plotDataFrameLongLatWithType(resultDfBrute,showdistance=True)
In [102]:
selectedPoints = combinedDfBall.sort_values('Coverage', ascending=False).reset_index(drop=True)
indicesWithinDistance=filterRowsWithinDistance(selectedPoints)
resultDfBall = selectedPoints.loc[indicesWithinDistance].reset_index(drop=True)
resultDfBall
Out[102]:
ID Gov City_Type Latitude Longitude Estimated_Sales Type operational_costs updatedEstimastedSales Coverage
0 129 القليوبية NaN 30.318393 31.310492 NaN B 15000 109193.996 94193.996
1 59 الشرقية NaN 30.723517 31.793006 NaN B 15000 96210.857 81210.857
2 116 القليوبية ريف 30.155107 31.192262 12431.234444 A 25000 105264.344 80264.344
3 1946 القليوبية ريف 30.284461 31.347094 168.683333 A 25000 94386.237 69386.237
4 343 القليوبية ريف 30.217819 31.209151 4702.623333 A 25000 85800.951 60800.951
... ... ... ... ... ... ... ... ... ... ...
131 643 الشرقية ريف 30.534998 31.717706 1864.231111 A 25000 25760.224 760.224
132 695 قنا ريف 26.103429 32.439087 1614.741111 A 25000 25749.606 749.606
133 1593 كفر الشيخ ريف 31.280505 31.165862 305.967778 A 25000 25678.984 678.984
134 452 المنوفية ريف 30.516160 30.883343 3224.993333 A 25000 25649.080 649.080
135 1982 اسيوط ريف 27.535805 30.793993 158.595556 A 25000 25604.988 604.988

136 rows × 10 columns

BriefSummary

In [103]:
print(f"The Total Number of Location Selected:{len(resultDfBall)}")
print(f"The Total Number of A-Location Selected:{len(resultDfBall[resultDfBall['Type']=='A'])}")
print(f"The Total Number of B-Location Selected:{len(resultDfBall[resultDfBall['Type']=='B'])}")
print(f"The Total Number of Locations Covered by selected:{len(combinedDfBall)}")
print(f"The Total Sales Equals:{resultDfBall['updatedEstimastedSales'].sum()}")
print(f"The Total Coverage Equals:{resultDfBall['Coverage'].sum()}")
print(f"The Total Costs Equals:{resultDfBall['operational_costs'].sum()}")
The Total Number of Location Selected:136
The Total Number of A-Location Selected:94
The Total Number of B-Location Selected:42
The Total Number of Locations Covered by selected:646
The Total Sales Equals:5579939.26
The Total Coverage Equals:2599939.2600000002
The Total Costs Equals:2980000
In [104]:
plotDataFrameLongLatWithType(resultDfBall,showdistance=True)

Clustring Algorthims¶

DBSCAN

DBSCAN is a different type of clustering algorithm with some unique advantages. As the name indicates, this method focuses more on the proximity and density of observations to form clusters

dbscan.png

Apply Clustring After perform Standrliztion and Convert to radians

In [106]:
srd=MinMaxScaler()
db = DBSCAN(eps=6/6371.0, min_samples=1, metric='haversine',algorithm='brute')  
y_db = db.fit_predict(srd.fit_transform(np.radians(combinedDfBallDBSCN[["Longitude","Latitude"]])))
combinedDfBallDBSCN['cluster'] = y_db
plotDataFrameLongLatWithType(combinedDfBallDBSCN,clstr=True)

Select the Point from each cluster based on Distance wihtin same cluster and it's higher coverage amount
For each cluster we take the point that have hiegher coverage amount and wihtin 5km to all points in the cluster

In [107]:
selectedClustePoints = selectPoints(combinedDfBallDBSCN)

Select the point that heighst coverage within 5km range Apply Filter

In [108]:
selectedPoints = selectedClustePoints.sort_values('Coverage', ascending=False).reset_index(drop=True)
indicesWithinDistance=filterRowsWithinDistance(selectedPoints)
resultDfBallClusterDBSCN = selectedPoints.loc[indicesWithinDistance].reset_index(drop=True)
resultDfBallClusterDBSCN
Out[108]:
ID Gov City_Type Latitude Longitude Estimated_Sales Type operational_costs updatedEstimastedSales Coverage cluster
0 129 القليوبية NaN 30.318393 31.310492 NaN B 15000 109193.996 94193.996 20
1 58 الشرقية NaN 30.732944 31.808439 NaN B 15000 96210.857 81210.857 468
2 116 القليوبية ريف 30.155107 31.192262 12431.234444 A 25000 105264.344 80264.344 73
3 1946 القليوبية ريف 30.284461 31.347094 168.683333 A 25000 94386.237 69386.237 374
4 765 الشرقية ريف 30.717965 31.756487 1339.983333 A 25000 91977.783 66977.783 203
... ... ... ... ... ... ... ... ... ... ... ...
133 643 الشرقية ريف 30.534998 31.717706 1864.231111 A 25000 25760.224 760.224 179
134 695 قنا ريف 26.103429 32.439087 1614.741111 A 25000 25749.606 749.606 186
135 1593 كفر الشيخ ريف 31.280505 31.165862 305.967778 A 25000 25678.984 678.984 316
136 452 المنوفية ريف 30.516160 30.883343 3224.993333 A 25000 25649.080 649.080 155
137 1982 اسيوط ريف 27.535805 30.793993 158.595556 A 25000 25604.988 604.988 378

138 rows × 11 columns

BriefSummary

In [109]:
print(f"The Total Number of Location Selected:{len(resultDfBallClusterDBSCN)}")
print(f"The Total Number of A-Location Selected:{len(resultDfBallClusterDBSCN[resultDfBallClusterDBSCN['Type']=='A'])}")
print(f"The Total Number of B-Location Selected:{len(resultDfBallClusterDBSCN[resultDfBallClusterDBSCN['Type']=='B'])}")
print(f"The Total Number of Locations Covered by selected:{len(combinedDfBallDBSCN)}")
print(f"The Total Sales Equals:{resultDfBallClusterDBSCN['updatedEstimastedSales'].sum()}")
print(f"The Total Coverage Equals:{resultDfBallClusterDBSCN['Coverage'].sum()}")
print(f"The Total Costs Equals:{resultDfBallClusterDBSCN['operational_costs'].sum()}")
The Total Number of Location Selected:138
The Total Number of A-Location Selected:96
The Total Number of B-Location Selected:42
The Total Number of Locations Covered by selected:646
The Total Sales Equals:5701418.605
The Total Coverage Equals:2671418.605
The Total Costs Equals:3030000
In [110]:
plotDataFrameLongLatWithType(resultDfBallClusterDBSCN,showdistance=True,showNumber=True)

OPTICS

OPTICS (Ordering Points To Identify the Clustering Structure) is a density-based clustering algorithm, similar to DBSCAN (Density-Based Spatial Clustering of Applications with Noise)

Core Distance: It is the minimum value of radius required to classify a given point as a core point. If the given point is not a Core point, then it’s Core Distance is undefined.

Reachability Distance: It is defined with respect to another data point q(Let). The Reachability distance between a point p and q is the maximum of the Core Distance of p and the Euclidean Distance(or some other distance metric) between p and q. Note that The Reachability Distance is not defined if q is not a Core point.

reachability_distance1.png

Cluster extraction:OPTICS produces a reachability distance plot that can be used to extract clusters at different levels of granularity. This allows for more flexible clustering and can reveal clusters that may not be apparent with a fixed epsilon value in DBSCAN. However, this also requires more manual interpretation and decision-making on the part of the programmer.

Noise handling: DBSCAN explicitly distinguishes between core points, boundary points, and noise points, while OPTICS does not explicitly identify noise points. Instead, points with high reachability distances can be considered as potential noise points. However, this also means that OPTICS may be less effective at identifying small clusters that are surrounded by noise points, as these clusters may be merged with the noise points in the reachability distance plot.

Runtime complexity: The runtime complexity of OPTICS is generally higher than that of DBSCAN, due to the use of a priority queue to maintain the reachability distances. However, recent research has proposed optimizations to reduce the computational complexity of OPTICS, making it more scalable for large datasets.

In [112]:
optics = OPTICS(min_samples=2, eps=5, metric='haversine',algorithm="brute")
y_db = optics.fit_predict(srd.fit_transform(np.radians(combinedDfBallOPTICS[["Longitude","Latitude"]])))
combinedDfBallOPTICS['cluster'] = y_db
plotDataFrameLongLatWithType(combinedDfBallOPTICS,clstr=True)
In [113]:
noiseData=combinedDfBallOPTICS[combinedDfBallOPTICS["cluster"]==-1]
startcount=max(combinedDfBallOPTICS[combinedDfBallOPTICS["cluster"]!=-1]["cluster"])
noiseData['cluster'] = range(startcount, (startcount-1)+len(noiseData) + 1)
selectedPoints = selectPoints(pd.concat([combinedDfBallOPTICS[combinedDfBallOPTICS["cluster"]!=-1], noiseData], ignore_index=True))
selectedPoints
Out[113]:
ID Gov City_Type Latitude Longitude Estimated_Sales Type operational_costs updatedEstimastedSales Coverage cluster
0 58 الشرقية NaN 30.732944 31.808439 NaN B 15000 96210.857 81210.857 0
1 1637 الشرقية ريف 30.685039 31.780953 282.131111 A 25000 89609.278 64609.278 1
2 1487 الشرقية ريف 30.731698 31.843480 369.911111 A 25000 103210.140 78210.140 2
3 866 الشرقية ريف 30.721617 31.702214 1066.661111 A 25000 59119.196 34119.196 3
4 63 الشرقية NaN 30.722667 31.664328 NaN B 15000 54942.319 39942.319 4
... ... ... ... ... ... ... ... ... ... ... ...
283 71 المنوفية NaN 30.468696 30.934389 NaN B 15000 44173.571 29173.571 283
284 72 المنوفية NaN 30.460861 30.938333 NaN B 15000 44173.571 29173.571 284
285 81 المنوفية NaN 30.428361 31.034500 NaN B 15000 29824.110 14824.110 285
286 107 الدقهلية NaN 31.093528 31.305411 NaN B 15000 18315.163 3315.163 286
287 111 القليوبية NaN 30.226889 31.365677 NaN B 15000 58432.181 43432.181 287

288 rows × 11 columns

In [114]:
selectedPoints = selectedPoints.sort_values('Coverage', ascending=False).reset_index(drop=True)
indicesWithinDistance=filterRowsWithinDistance(selectedPoints)
resultDfBallClusterOPTICS = selectedPoints.loc[indicesWithinDistance].reset_index(drop=True)
resultDfBallClusterOPTICS
Out[114]:
ID Gov City_Type Latitude Longitude Estimated_Sales Type operational_costs updatedEstimastedSales Coverage cluster
0 129 القليوبية NaN 30.318393 31.310492 NaN B 15000 109193.996 94193.996 61
1 58 الشرقية NaN 30.732944 31.808439 NaN B 15000 96210.857 81210.857 0
2 116 القليوبية ريف 30.155107 31.192262 12431.234444 A 25000 105264.344 80264.344 52
3 1946 القليوبية ريف 30.284461 31.347094 168.683333 A 25000 94386.237 69386.237 64
4 765 الشرقية ريف 30.717965 31.756487 1339.983333 A 25000 91977.783 66977.783 200
... ... ... ... ... ... ... ... ... ... ... ...
116 643 الشرقية ريف 30.534998 31.717706 1864.231111 A 25000 25760.224 760.224 190
117 695 قنا ريف 26.103429 32.439087 1614.741111 A 25000 25749.606 749.606 192
118 452 المنوفية ريف 30.516160 30.883343 3224.993333 A 25000 25649.080 649.080 182
119 1982 اسيوط ريف 27.535805 30.793993 158.595556 A 25000 25604.988 604.988 249
120 67 كفر الشيخ حضر 31.309466 31.150495 16794.541111 A 25000 25348.356 348.356 122

121 rows × 11 columns

BriefSummary

In [115]:
print(f"The Total Number of Location Selected:{len(resultDfBallClusterOPTICS)}")
print(f"The Total Number of A-Location Selected:{len(resultDfBallClusterOPTICS[resultDfBallClusterOPTICS['Type']=='A'])}")
print(f"The Total Number of B-Location Selected:{len(resultDfBallClusterOPTICS[resultDfBallClusterOPTICS['Type']=='B'])}")
print(f"The Total Number of Locations Covered by selected:{len(combinedDfBallOPTICS)}")
print(f"The Total Sales Equals:{resultDfBallClusterOPTICS['updatedEstimastedSales'].sum()}")
print(f"The Total Coverage Equals:{resultDfBallClusterOPTICS['Coverage'].sum()}")
print(f"The Total Costs Equals:{resultDfBallClusterOPTICS['operational_costs'].sum()}")
The Total Number of Location Selected:121
The Total Number of A-Location Selected:82
The Total Number of B-Location Selected:39
The Total Number of Locations Covered by selected:646
The Total Sales Equals:5181672.782999999
The Total Coverage Equals:2546672.7830000003
The Total Costs Equals:2635000

DisplayResults

In [116]:
plotDataFrameLongLatWithType(resultDfBallClusterOPTICS,clstr=False,showdistance=True,showNumber=True)

HDBSCAN

Estimating densities We need some method to estimate the density around certain points. One common way to do this is by using “core distance.” This is the distance of a point to its K-th nearest neighbor.

1_XI359LqPheRAR4me3-jFtg.webp

Points in denser regions would have smaller core distances while points in sparser regions would have larger core distances. Core distance is what makes these methods “density-based”.

1_eRW_IFZeL1aZ8AL26BwArw.webp

Simple Cluster Selection

One way to select clusters is to pick a global threshold. By getting the points with densities above the threshold, and grouping these points together, we get our clusters.

1_txhiQ6wFmrxd0MwCdwU3vQ.webp

In [118]:
hdbscn = hdbscan.HDBSCAN(min_cluster_size=2,min_samples=1, cluster_selection_epsilon=5/6371.0, metric='haversine')
y_db = hdbscn.fit_predict(srd.fit_transform(np.radians(combinedDfBallHDBSCN[["Longitude","Latitude"]])))
combinedDfBallHDBSCN['cluster'] = y_db
In [119]:
noiseData=combinedDfBallHDBSCN[combinedDfBallHDBSCN["cluster"]==-1]
startcount=max(combinedDfBallHDBSCN[combinedDfBallHDBSCN["cluster"]!=-1]["cluster"])
noiseData['cluster'] = range(startcount, (startcount-1)+len(noiseData) + 1)
selectedPoints = selectPoints(pd.concat([combinedDfBallHDBSCN[combinedDfBallHDBSCN["cluster"]!=-1], noiseData], ignore_index=True))
selectedPoints
Out[119]:
ID Gov City_Type Latitude Longitude Estimated_Sales Type operational_costs updatedEstimastedSales Coverage cluster
0 317 مرسى مطروح حضر 31.352412 27.255389 5367.276667 A 25000 84420.660 59420.660 0
1 112 مرسى مطروح حضر 31.327339 27.265570 12714.126667 A 25000 84420.660 59420.660 1
2 1194 جنوب سيناء حضر 27.892750 34.298750 578.293333 A 25000 42917.283 17917.283 2
3 41 البحر الاحمر حضر 28.340011 33.086865 21800.420000 A 25000 29658.218 4658.218 3
4 30 شمال سيناء حضر 31.126209 33.808316 25193.351111 A 25000 27226.639 2226.639 4
... ... ... ... ... ... ... ... ... ... ... ...
233 46 القاهرة NaN 30.164937 31.337363 NaN B 15000 19896.361 4896.361 233
234 51 القاهرة NaN 30.165917 31.605466 NaN B 15000 17900.754 2900.754 234
235 56 كفر الشيخ NaN 31.274111 30.788123 NaN B 15000 18797.744 3797.744 235
236 107 الدقهلية NaN 31.093528 31.305411 NaN B 15000 18315.163 3315.163 236
237 111 القليوبية NaN 30.226889 31.365677 NaN B 15000 58432.181 43432.181 237

238 rows × 11 columns

In [120]:
selectedPoints = selectedPoints.sort_values('Coverage', ascending=False).reset_index(drop=True)
indicesWithinDistance=filterRowsWithinDistance(selectedPoints)
resultDfBallClusterHDBSCN = selectedPoints.loc[indicesWithinDistance].reset_index(drop=True)
resultDfBallClusterHDBSCN
Out[120]:
ID Gov City_Type Latitude Longitude Estimated_Sales Type operational_costs updatedEstimastedSales Coverage cluster
0 129 القليوبية NaN 30.318393 31.310492 NaN B 15000 109193.996 94193.996 146
1 58 الشرقية NaN 30.732944 31.808439 NaN B 15000 96210.857 81210.857 87
2 116 القليوبية ريف 30.155107 31.192262 12431.234444 A 25000 105264.344 80264.344 154
3 1946 القليوبية ريف 30.284461 31.347094 168.683333 A 25000 94386.237 69386.237 140
4 765 الشرقية ريف 30.717965 31.756487 1339.983333 A 25000 91977.783 66977.783 182
... ... ... ... ... ... ... ... ... ... ... ...
114 643 الشرقية ريف 30.534998 31.717706 1864.231111 A 25000 25760.224 760.224 176
115 695 قنا ريف 26.103429 32.439087 1614.741111 A 25000 25749.606 749.606 178
116 452 المنوفية ريف 30.516160 30.883343 3224.993333 A 25000 25649.080 649.080 169
117 1982 اسيوط ريف 27.535805 30.793993 158.595556 A 25000 25604.988 604.988 217
118 67 كفر الشيخ حضر 31.309466 31.150495 16794.541111 A 25000 25348.356 348.356 38

119 rows × 11 columns

DisplayResults

In [121]:
plotDataFrameLongLatWithType(resultDfBallClusterHDBSCN,clstr=False,showdistance=True,showNumber=True)

BriefSummary

In [122]:
print(f"The Total Number of Location Selected:{len(resultDfBallClusterHDBSCN)}")
print(f"The Total Number of A-Location Selected:{len(resultDfBallClusterHDBSCN[resultDfBallClusterHDBSCN['Type']=='A'])}")
print(f"The Total Number of B-Location Selected:{len(resultDfBallClusterHDBSCN[resultDfBallClusterHDBSCN['Type']=='B'])}")
print(f"The Total Number of Locations Covered by selected:{len(combinedDfBallHDBSCN)}")
print(f"The Total Sales Equals:{resultDfBallClusterHDBSCN['updatedEstimastedSales'].sum()}")
print(f"The Total Coverage Equals:{resultDfBallClusterHDBSCN['Coverage'].sum()}")
print(f"The Total Costs Equals:{resultDfBallClusterHDBSCN['operational_costs'].sum()}")
The Total Number of Location Selected:119
The Total Number of A-Location Selected:80
The Total Number of B-Location Selected:39
The Total Number of Locations Covered by selected:646
The Total Sales Equals:5101434.974999999
The Total Coverage Equals:2516434.975000001
The Total Costs Equals:2585000

OptionOneResult

In [124]:
optionOneDFResult
Out[124]:
choices # of-Loc # of-A-Loc # of-B-Loc # of-LocCover TotalSales TotalCoverage TotalCosts
0 HDBSCAN 119 80 39 646 5101434.975 2516434.975 2585000
1 DBSCAN 138 96 42 646 5701418.605 2671418.605 3030000
2 OPTICS 121 82 39 646 5181672.783 2546672.783 2635000
3 BallTree 136 94 42 646 5579939.260 2599939.260 2980000
4 Brute 137 94 43 647 5601378.195 2606378.195 2995000

Option 2¶

In [125]:
combinedDFOption2=pd.concat([dataFrameTypeA, dataFrameTypeB], ignore_index=True)

DBSCAN

In [128]:
srd=MinMaxScaler()
db = DBSCAN(eps=6/6371.0, min_samples=1, metric='haversine',algorithm='brute')  
y_db = db.fit_predict(srd.fit_transform(np.radians(combinedDfDBSCNOPT2[["Longitude","Latitude"]])))
combinedDfDBSCNOPT2['cluster'] = y_db
plotDataFrameLongLatWithType(combinedDfDBSCNOPT2,clstr=True)
In [129]:
combinedDfDBSCNOPT2['updatedEstimastedSales'] = combinedDfDBSCNOPT2.groupby('cluster')['Estimated_Sales'].transform('sum')
In [132]:
combinedDfDBSCNOPT2=combinedDfDBSCNOPT2[combinedDfDBSCNOPT2["updatedEstimastedSales"]>combinedDfDBSCNOPT2["operational_costs"]]
In [133]:
combinedDfDBSCNOPT2
Out[133]:
ID Gov City_Type Latitude Longitude Estimated_Sales Type operational_costs updatedEstimastedSales cluster
0 1 الشرقية حضر 30.727543 31.793701 72021.026667 A 25000 85931.800000 0
1 2 القاهرة حضر 30.235685 31.467567 54690.637778 A 25000 54690.637778 1
2 3 المنوفية حضر 30.361963 30.512031 42379.708889 A 25000 42379.708889 2
3 4 أسوان حضر 24.977806 32.875934 42377.312222 A 25000 56644.472222 3
4 5 مرسى مطروح حضر 31.355745 27.244068 40826.774444 A 25000 41380.507778 4
... ... ... ... ... ... ... ... ... ... ...
2660 125 القليوبية NaN 30.310222 31.323466 NaN B 15000 65058.061789 20
2663 128 القليوبية NaN 30.145500 31.215355 NaN B 15000 31718.676667 12
2664 129 القليوبية NaN 30.318393 31.310492 NaN B 15000 65058.061789 20
2666 131 القليوبية NaN 30.138000 31.220216 NaN B 15000 31718.676667 12
2667 132 الاسماعيلية NaN 30.852917 32.306078 NaN B 15000 16944.621111 60

169 rows × 10 columns

In [134]:
combinedDfDBSCNOPT2["Coverage"]=combinedDfDBSCNOPT2["updatedEstimastedSales"]-combinedDfDBSCNOPT2["operational_costs"]
In [135]:
selectedPoints = selectPoints(combinedDfDBSCNOPT2)
selectedPoints = selectedPoints.sort_values('Coverage', ascending=False).reset_index(drop=True)
indicesWithinDistance=filterRowsWithinDistance(selectedPoints)
resultDfDBSCNOPT2 = selectedPoints.loc[indicesWithinDistance].reset_index(drop=True)
resultDfDBSCNOPT2
Out[135]:
ID Gov City_Type Latitude Longitude Estimated_Sales Type operational_costs updatedEstimastedSales cluster Coverage
0 57 الشرقية NaN 30.733709 31.791613 NaN B 15000 85931.800000 0 70931.800000
1 18 أسوان حضر 24.089784 32.898154 30039.402222 A 25000 83389.173333 17 58389.173333
2 123 القليوبية NaN 30.313824 31.318385 NaN B 15000 65058.061789 20 50058.061789
3 30 المنيا NaN 28.647222 30.843517 NaN B 15000 62077.036667 14 47077.036667
4 33 المنيا NaN 28.316419 30.705637 NaN B 15000 59845.363333 13 44845.363333
5 63 الشرقية NaN 30.722667 31.664328 NaN B 15000 53305.933333 5 38305.933333
6 115 القليوبية NaN 30.180428 31.217045 NaN B 15000 49031.280000 7 34031.280000
7 4 أسوان حضر 24.977806 32.875934 42377.312222 A 25000 56644.472222 3 31644.472222
8 62 الشرقية NaN 30.523139 31.348689 NaN B 15000 45520.860000 10 30520.860000
9 2 القاهرة حضر 30.235685 31.467567 54690.637778 A 25000 54690.637778 1 29690.637778
10 93 البحيرة NaN 30.890535 30.663745 NaN B 15000 41950.432222 8 26950.432222
11 77 المنوفية NaN 30.298833 30.983217 NaN B 15000 37151.773333 39 22151.773333
12 29 المنيا NaN 27.736193 30.834816 NaN B 15000 35794.327778 29 20794.327778
13 105 الدقهلية NaN 31.194083 31.528133 NaN B 15000 34270.371111 11 19270.371111
14 53 الغربية NaN 30.939739 30.806615 NaN B 15000 33141.896667 22 18141.896667
15 3 المنوفية حضر 30.361963 30.512031 42379.708889 A 25000 42379.708889 2 17379.708889
16 5 مرسى مطروح حضر 31.355745 27.244068 40826.774444 A 25000 41380.507778 4 16380.507778
17 7 القليوبية ريف 30.334697 31.268586 40289.102156 A 25000 40289.102156 6 15289.102156
18 102 الدقهلية NaN 31.090019 31.597546 NaN B 15000 28787.932222 23 13787.932222
19 92 البحيرة NaN 30.762731 30.694217 NaN B 15000 27990.341111 38 12990.341111
20 98 البحيرة NaN 30.830361 30.539272 NaN B 15000 27810.555922 49 12810.555922
21 113 القليوبية NaN 30.245453 31.353542 NaN B 15000 27640.241111 78 12640.241111
22 22 اسيوط حضر 27.061354 31.340125 27662.900000 A 25000 37184.572222 21 12184.572222
23 96 البحيرة NaN 31.030694 30.715550 NaN B 15000 26287.638889 30 11287.638889
24 66 الشرقية NaN 30.791750 31.620855 NaN B 15000 26143.969833 24 11143.969833
25 44 بني سويف حضر 28.925050 30.855897 20982.414444 A 25000 36143.598889 40 11143.598889
26 9 الاسكندرية NaN 31.159778 29.869967 NaN B 15000 26005.710000 36 11005.710000
27 10 جنوب سيناء حضر 27.860291 34.303116 34468.973333 A 25000 34843.250000 9 9843.250000
28 61 الشرقية NaN 30.535955 31.677756 NaN B 15000 23483.897778 50 8483.897778
29 55 الغربية NaN 30.729556 31.117917 NaN B 15000 23199.595556 92 8199.595556
30 45 القاهرة NaN 30.469000 30.939133 NaN B 15000 22645.623333 35 7645.623333
31 16 الدقهلية حضر 31.166643 31.803276 30836.737778 A 25000 32433.765556 15 7433.765556
32 4 الفيوم NaN 29.361996 30.683888 NaN B 15000 21962.411111 51 6962.411111
33 52 الغربية NaN 30.824112 30.810771 NaN B 15000 21934.857778 79 6934.857778
34 20 اسيوط حضر 27.443933 30.830140 29299.514444 A 25000 31612.167778 19 6612.167778
35 60 الشرقية NaN 30.750426 31.452409 NaN B 15000 20863.726667 120 5863.726667
36 17 الدقهلية حضر 30.881879 31.454996 30526.830000 A 25000 30526.830000 16 5526.830000
37 34 المنيا NaN 28.082083 30.763189 NaN B 15000 20296.282222 48 5296.282222
38 70 الجيزة ريف 29.775070 31.267214 16106.003333 A 25000 30083.615556 64 5083.615556
39 19 دمياط حضر 31.354685 31.685219 29965.090000 A 25000 29965.090000 18 4965.090000
40 35 المنيا NaN 28.696693 30.784010 NaN B 15000 19644.896667 46 4644.896667
41 68 المنوفية NaN 30.426583 31.044210 NaN B 15000 19487.016667 86 4487.016667
42 6 الفيوم NaN 29.474248 30.953917 NaN B 15000 19153.417833 69 4153.417833
43 100 البحيرة NaN 30.910763 30.173929 NaN B 15000 17994.738889 55 2994.738889
44 51 القاهرة NaN 30.165917 31.605466 NaN B 15000 17900.754444 52 2900.754444
45 84 المنوفية NaN 30.549361 31.036855 NaN B 15000 17693.730000 95 2693.730000
46 38 اسيوط حضر 27.264926 31.154670 22491.631111 A 25000 27147.948889 34 2147.948889
47 132 الاسماعيلية NaN 30.852917 32.306078 NaN B 15000 16944.621111 60 1944.621111
48 30 شمال سيناء حضر 31.126209 33.808316 25193.351111 A 25000 26607.242222 26 1607.242222
49 28 الدقهلية حضر 31.159029 31.936320 25787.155556 A 25000 25787.155556 25 787.155556
50 107 الدقهلية NaN 31.093528 31.305411 NaN B 15000 15635.557778 71 635.557778
51 31 القاهرة حضر 29.847197 31.377750 24386.654444 A 25000 25488.334444 27 488.334444
52 25 بني سويف NaN 29.342944 31.208105 NaN B 15000 15339.275556 214 339.275556

DisplayResults

In [136]:
plotDataFrameLongLatWithType(resultDfDBSCNOPT2,showdistance=True,showNumber=True)

BriefSummary

In [137]:
print(f"The Total Number of Location Selected:{len(resultDfDBSCNOPT2)}")
print(f"The Total Number of A-Location Selected:{len(resultDfDBSCNOPT2[resultDfDBSCNOPT2['Type']=='A'])}")
print(f"The Total Number of B-Location Selected:{len(resultDfDBSCNOPT2[resultDfDBSCNOPT2['Type']=='B'])}")
print(f"The Total Number of Locations Covered by selected:{len(combinedDfDBSCNOPT2)}")
print(f"The Total Sales Equals:{resultDfDBSCNOPT2['updatedEstimastedSales'].sum()}")
print(f"The Total Coverage Equals:{resultDfDBSCNOPT2['Coverage'].sum()}")
print(f"The Total Costs Equals:{resultDfDBSCNOPT2['operational_costs'].sum()}")
The Total Number of Location Selected:53
The Total Number of A-Location Selected:18
The Total Number of B-Location Selected:35
The Total Number of Locations Covered by selected:169
The Total Sales Equals:1794523.101977778
The Total Coverage Equals:819523.1019777778
The Total Costs Equals:975000

OPTICS

In [139]:
optics = OPTICS(min_samples=2, eps=5, metric='haversine',algorithm="brute")
y_db = optics.fit_predict(srd.fit_transform(np.radians(combinedDfOPTICSOPT2[["Longitude","Latitude"]])))
combinedDfOPTICSOPT2['cluster'] = y_db
plotDataFrameLongLatWithType(combinedDfOPTICSOPT2,clstr=True)
In [140]:
combinedDfOPTICSOPT2['updatedEstimastedSales'] = combinedDfOPTICSOPT2.groupby('cluster')['Estimated_Sales'].transform('sum')
In [141]:
displayRemovedLocations(combinedDfOPTICSOPT2)
In [142]:
combinedDfOPTICSOPT2=combinedDfOPTICSOPT2[combinedDfOPTICSOPT2["updatedEstimastedSales"]>combinedDfOPTICSOPT2["operational_costs"]]
In [144]:
combinedDfOPTICSOPT2["Coverage"]=combinedDfOPTICSOPT2["updatedEstimastedSales"]-combinedDfOPTICSOPT2["operational_costs"]
In [145]:
noiseData=combinedDfOPTICSOPT2[combinedDfOPTICSOPT2["cluster"]==-1]
startcount=max(combinedDfOPTICSOPT2[combinedDfOPTICSOPT2["cluster"]!=-1]["cluster"])
noiseData['cluster'] = range(startcount, (startcount-1)+len(noiseData) + 1)
selectedPoints = selectPoints(pd.concat([combinedDfOPTICSOPT2[combinedDfOPTICSOPT2["cluster"]!=-1], noiseData], ignore_index=True))
indicesWithinDistance=filterRowsWithinDistance(selectedPoints)
resultDfOPTICSOPT2 = selectedPoints.loc[indicesWithinDistance].reset_index(drop=True)
resultDfOPTICSOPT2
Out[145]:
ID Gov City_Type Latitude Longitude Estimated_Sales Type operational_costs updatedEstimastedSales cluster Coverage
0 57 الشرقية NaN 30.733709 31.791613 NaN B 15000 85931.800000 0 70931.800000
1 63 الشرقية NaN 30.722667 31.664328 NaN B 15000 53305.933333 10 38305.933333
2 60 الشرقية NaN 30.750426 31.452409 NaN B 15000 22111.735556 36 7111.735556
3 62 الشرقية NaN 30.523139 31.348689 NaN B 15000 46646.990000 51 31646.990000
4 84 المنوفية NaN 30.549361 31.036855 NaN B 15000 17693.730000 99 2693.730000
... ... ... ... ... ... ... ... ... ... ... ...
344 2492 بني سويف ريف 29.029384 31.082819 40.915556 A 25000 817101.511389 1217 792101.511389
345 2507 أسوان حضر 23.979483 32.896105 37.955556 A 25000 817101.511389 1219 792101.511389
346 2509 قنا ريف 26.194544 32.554853 37.046667 A 25000 817101.511389 1220 792101.511389
347 2511 المنيا ريف 28.454434 30.668412 36.696667 A 25000 817101.511389 1221 792101.511389
348 43 المنيا NaN 28.093250 30.814605 NaN B 15000 817101.511389 1227 802101.511389

349 rows × 11 columns

In [146]:
plotDataFrameLongLatWithType(resultDfOPTICSOPT2,showdistance=True,showNumber=True)
In [147]:
print(f"The Total Number of Location Selected:{len(resultDfOPTICSOPT2)}")
print(f"The Total Number of A-Location Selected:{len(resultDfOPTICSOPT2[resultDfOPTICSOPT2['Type']=='A'])}")
print(f"The Total Number of B-Location Selected:{len(resultDfOPTICSOPT2[resultDfOPTICSOPT2['Type']=='B'])}")
print(f"The Total Number of Locations Covered by selected:{len(combinedDfOPTICSOPT2)}")
print(f"The Total Sales Equals:{resultDfOPTICSOPT2['updatedEstimastedSales'].sum()}")
print(f"The Total Coverage Equals:{resultDfOPTICSOPT2['Coverage'].sum()}")
print(f"The Total Costs Equals:{resultDfOPTICSOPT2['operational_costs'].sum()}")
The Total Number of Location Selected:349
The Total Number of A-Location Selected:311
The Total Number of B-Location Selected:38
The Total Number of Locations Covered by selected:708
The Total Sales Equals:238218585.73551106
The Total Coverage Equals:229873585.73551106
The Total Costs Equals:8345000

HDBSCAN

In [148]:
combinedDfHDBSCNOPT2=combinedDFOption2
In [149]:
hdbscn = hdbscan.HDBSCAN(min_cluster_size=2,min_samples=1, cluster_selection_epsilon=5/6371.0, metric='haversine')
y_db = hdbscn.fit_predict(srd.fit_transform(np.radians(combinedDfHDBSCNOPT2[["Longitude","Latitude"]])))
combinedDfHDBSCNOPT2['cluster'] = y_db
plotDataFrameLongLatWithType(combinedDfHDBSCNOPT2,clstr=True)
In [150]:
combinedDfHDBSCNOPT2['updatedEstimastedSales'] = combinedDfHDBSCNOPT2.groupby('cluster')['Estimated_Sales'].transform('sum')
In [151]:
displayRemovedLocations(combinedDfHDBSCNOPT2)
In [152]:
combinedDfHDBSCNOPT2=combinedDfHDBSCNOPT2[combinedDfHDBSCNOPT2["updatedEstimastedSales"]>combinedDfHDBSCNOPT2["operational_costs"]]
In [153]:
combinedDfHDBSCNOPT2
Out[153]:
ID Gov City_Type Latitude Longitude Estimated_Sales Type operational_costs updatedEstimastedSales cluster
0 1 الشرقية حضر 30.727543 31.793701 72021.026667 A 25000 91445.203333 593
1 2 القاهرة حضر 30.235685 31.467567 54690.637778 A 25000 57434.150000 56
2 3 المنوفية حضر 30.361963 30.512031 42379.708889 A 25000 51029.818889 31
3 4 أسوان حضر 24.977806 32.875934 42377.312222 A 25000 62155.374444 101
4 5 مرسى مطروح حضر 31.355745 27.244068 40826.774444 A 25000 46747.784444 28
... ... ... ... ... ... ... ... ... ... ...
2660 125 القليوبية NaN 30.310222 31.323466 NaN B 15000 66996.411789 669
2663 128 القليوبية NaN 30.145500 31.215355 NaN B 15000 31718.676667 674
2664 129 القليوبية NaN 30.318393 31.310492 NaN B 15000 66996.411789 669
2666 131 القليوبية NaN 30.138000 31.220216 NaN B 15000 31718.676667 674
2667 132 الاسماعيلية NaN 30.852917 32.306078 NaN B 15000 31237.639367 42

598 rows × 10 columns

In [154]:
combinedDfHDBSCNOPT2["Coverage"]=combinedDfHDBSCNOPT2["updatedEstimastedSales"]-combinedDfHDBSCNOPT2["operational_costs"]
In [155]:
noiseData=combinedDfHDBSCNOPT2[combinedDfHDBSCNOPT2["cluster"]==-1]
startcount=max(combinedDfHDBSCNOPT2[combinedDfHDBSCNOPT2["cluster"]!=-1]["cluster"])
noiseData['cluster'] = range(startcount, (startcount-1)+len(noiseData) + 1)
selectedPoints = selectPoints(pd.concat([combinedDfHDBSCNOPT2[combinedDfHDBSCNOPT2["cluster"]!=-1], noiseData], ignore_index=True))
indicesWithinDistance=filterRowsWithinDistance(selectedPoints)
resultDfHDBSCNOPT2 = selectedPoints.loc[indicesWithinDistance].reset_index(drop=True)
resultDfHDBSCNOPT2
Out[155]:
ID Gov City_Type Latitude Longitude Estimated_Sales Type operational_costs updatedEstimastedSales cluster Coverage
0 30 شمال سيناء حضر 31.126209 33.808316 25193.351111 A 25000 27797.405556 12 2797.405556
1 41 البحر الاحمر حضر 28.340011 33.086865 21800.420000 A 25000 29658.217778 20 4658.217778
2 10 جنوب سيناء حضر 27.860291 34.303116 34468.973333 A 25000 42950.647778 22 17950.647778
3 77 مرسى مطروح حضر 31.318053 27.287814 15664.434444 A 25000 37672.875556 27 12672.875556
4 5 مرسى مطروح حضر 31.355745 27.244068 40826.774444 A 25000 46747.784444 28 21747.784444
... ... ... ... ... ... ... ... ... ... ... ...
289 2430 الشرقية ريف 30.693489 31.739608 50.696667 A 25000 624781.123611 1042 599781.123611
290 2488 الدقهلية ريف 30.834214 31.263532 41.683333 A 25000 624781.123611 1046 599781.123611
291 2507 أسوان حضر 23.979483 32.896105 37.955556 A 25000 624781.123611 1048 599781.123611
292 2511 المنيا ريف 28.454434 30.668412 36.696667 A 25000 624781.123611 1049 599781.123611
293 43 المنيا NaN 28.093250 30.814605 NaN B 15000 624781.123611 1052 609781.123611

294 rows × 11 columns

In [156]:
plotDataFrameLongLatWithType(resultDfHDBSCNOPT2,showdistance=True,showNumber=True)
In [157]:
print(f"The Total Number of Location Selected:{len(resultDfHDBSCNOPT2)}")
print(f"The Total Number of A-Location Selected:{len(resultDfHDBSCNOPT2[resultDfHDBSCNOPT2['Type']=='A'])}")
print(f"The Total Number of B-Location Selected:{len(resultDfHDBSCNOPT2[resultDfHDBSCNOPT2['Type']=='B'])}")
print(f"The Total Number of Locations Covered by selected:{len(combinedDfHDBSCNOPT2)}")
print(f"The Total Sales Equals:{resultDfHDBSCNOPT2['updatedEstimastedSales'].sum()}")
print(f"The Total Coverage Equals:{resultDfHDBSCNOPT2['Coverage'].sum()}")
print(f"The Total Costs Equals:{resultDfHDBSCNOPT2['operational_costs'].sum()}")
The Total Number of Location Selected:294
The Total Number of A-Location Selected:253
The Total Number of B-Location Selected:41
The Total Number of Locations Covered by selected:598
The Total Sales Equals:147115382.07745552
The Total Coverage Equals:140175382.07745552
The Total Costs Equals:6940000
In [ ]:
FinalResult=pd.concat([optionOneDFResult, optionTwoDFResult],ignore_index=True)
FinalResult
choices # of-Loc # of-A-Loc # of-B-Loc # of-LocCover TotalSales TotalCoverage TotalCosts
0 HDBSCAN 119 80 39 646 5.101435e+06 2.516435e+06 2585000
1 DBSCAN 138 96 42 646 5.701419e+06 2.671419e+06 3030000
2 OPTICS 121 82 39 646 5.181673e+06 2.546673e+06 2635000
3 BallTree 136 94 42 646 5.579939e+06 2.599939e+06 2980000
4 Brute 137 94 43 647 5.601378e+06 2.606378e+06 2995000
5 HDBSCAN-OPT2 294 253 41 598 1.471154e+08 1.401754e+08 6940000
6 DBSCAN-OPT2 53 18 35 169 1.794523e+06 8.195231e+05 975000
7 OPTICS-OPT2 349 311 38 708 2.382186e+08 2.298736e+08 8345000
In [161]:
FinalResult["ProfitMargin"]=FinalResult["TotalCoverage"]/FinalResult["TotalSales"]
FinalResult["CostPerLocation"]=FinalResult["TotalCosts"]/FinalResult["# of-Loc"]
FinalResult["LocationCoverage"]=FinalResult["# of-Loc"]/FinalResult["# of-LocCover"]
FinalResult["EfficiencyMetric"]=FinalResult["ProfitMargin"]*(1/FinalResult["LocationCoverage"])
In [162]:
FinalResult
Out[162]:
choices # of-Loc # of-A-Loc # of-B-Loc # of-LocCover TotalSales TotalCoverage TotalCosts ProfitMargin CostPerLocation LocationCoverage EfficiencyMetric
0 HDBSCAN 119 80 39 646 5.101435e+06 2.516435e+06 2585000 0.493280 21722.689076 0.184211 2.677805
1 DBSCAN 138 96 42 646 5.701419e+06 2.671419e+06 3030000 0.468553 21956.521739 0.213622 2.193373
2 OPTICS 121 82 39 646 5.181673e+06 2.546673e+06 2635000 0.491477 21776.859504 0.187307 2.623918
3 BallTree 136 94 42 646 5.579939e+06 2.599939e+06 2980000 0.465944 21911.764706 0.210526 2.213234
4 Brute 137 94 43 647 5.601378e+06 2.606378e+06 2995000 0.465310 21861.313869 0.211747 2.197487
5 HDBSCAN-OPT2 294 253 41 598 1.471154e+08 1.401754e+08 6940000 0.952826 23605.442177 0.491639 1.938061
6 DBSCAN-OPT2 53 18 35 169 1.794523e+06 8.195231e+05 975000 0.456680 18396.226415 0.313609 1.456207
7 OPTICS-OPT2 349 311 38 708 2.382186e+08 2.298736e+08 8345000 0.964969 23911.174785 0.492938 1.957588

Analysis

Profit Margin: Highest: OPTICS-OPT2 (0.964969) Second Highest: HDBSCAN-OPT2 (0.952826)

Total Coverage: Highest: OPTICS-OPT2 (2.298736e+08) Second Highest: HDBSCAN-OPT2 (1.401754e+08)

Efficiency Metric: Highest: HDBSCAN (2.677805) Second Highest: OPTICS (2.623918)

Cost Per Location: Lowest: DBSCAN-OPT2 (18396.226415) Second Lowest: HDBSCAN (21722.689076)

Conclusion

OPTICS-OPT2 emerges as the best overall choice due to its: Highest Profit Margin (0.964969) Highest Total Coverage (2.298736e+08)

While its Efficiency Metric (1.957588) is lower than that of HDBSCAN and OPTICS, the substantial profit margin and total coverage outweigh this factor. Additionally, its cost per location is relatively high, but this is compensated by the significantly higher coverage and profit margin.

In [163]:
resultDfOPTICSOPT2
Out[163]:
ID Gov City_Type Latitude Longitude Estimated_Sales Type operational_costs updatedEstimastedSales cluster Coverage
0 57 الشرقية NaN 30.733709 31.791613 NaN B 15000 85931.800000 0 70931.800000
1 63 الشرقية NaN 30.722667 31.664328 NaN B 15000 53305.933333 10 38305.933333
2 60 الشرقية NaN 30.750426 31.452409 NaN B 15000 22111.735556 36 7111.735556
3 62 الشرقية NaN 30.523139 31.348689 NaN B 15000 46646.990000 51 31646.990000
4 84 المنوفية NaN 30.549361 31.036855 NaN B 15000 17693.730000 99 2693.730000
... ... ... ... ... ... ... ... ... ... ... ...
344 2492 بني سويف ريف 29.029384 31.082819 40.915556 A 25000 817101.511389 1217 792101.511389
345 2507 أسوان حضر 23.979483 32.896105 37.955556 A 25000 817101.511389 1219 792101.511389
346 2509 قنا ريف 26.194544 32.554853 37.046667 A 25000 817101.511389 1220 792101.511389
347 2511 المنيا ريف 28.454434 30.668412 36.696667 A 25000 817101.511389 1221 792101.511389
348 43 المنيا NaN 28.093250 30.814605 NaN B 15000 817101.511389 1227 802101.511389

349 rows × 11 columns

In [164]:
plotDataFrameLongLatWithType(resultDfOPTICSOPT2,showdistance=True,showNumber=True)
In [ ]: